Import pandas, PCA and Standard Scaler
PCA can give you wonky results if the variance in the original dataset is large, so we want to standardize the data. StandardScaler allows you to standardize the dataset so the mean is 0 and the variance is 1. This process is common in ML models
import pandas as pd
import plotly.express as px
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
Load iris and take a look at the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
df = pd.read_csv(url, names=['sepal_length','sepal_width','petal_length','petal_width','target'])
df.head()
| sepal_length | sepal_width | petal_length | petal_width | target | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
Split the dataset into X and y
X = df.drop('target', 1)
y = df.target
Apply the standardization to the X values
X = StandardScaler().fit_transform(X)
print(pd.DataFrame(X))
0 1 2 3 0 -0.900681 1.032057 -1.341272 -1.312977 1 -1.143017 -0.124958 -1.341272 -1.312977 2 -1.385353 0.337848 -1.398138 -1.312977 3 -1.506521 0.106445 -1.284407 -1.312977 4 -1.021849 1.263460 -1.341272 -1.312977 .. ... ... ... ... 145 1.038005 -0.124958 0.819624 1.447956 146 0.553333 -1.281972 0.705893 0.922064 147 0.795669 -0.124958 0.819624 1.053537 148 0.432165 0.800654 0.933356 1.447956 149 0.068662 -0.124958 0.762759 0.790591 [150 rows x 4 columns]
Run the principal component analysis model on X
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(X)
pcaDF = pd.DataFrame(data = principalComponents, columns = ['PC1', 'PC2'])
pcaDF
| PC1 | PC2 | |
|---|---|---|
| 0 | -2.264542 | 0.505704 |
| 1 | -2.086426 | -0.655405 |
| 2 | -2.367950 | -0.318477 |
| 3 | -2.304197 | -0.575368 |
| 4 | -2.388777 | 0.674767 |
| ... | ... | ... |
| 145 | 1.870522 | 0.382822 |
| 146 | 1.558492 | -0.905314 |
| 147 | 1.520845 | 0.266795 |
| 148 | 1.376391 | 1.016362 |
| 149 | 0.959299 | -0.022284 |
150 rows × 2 columns
Combine the principalDf with y to get a dataframe with both the components and y
finalDf = pd.concat([pcaDF, y], axis = 1)
finalDf
| PC1 | PC2 | target | |
|---|---|---|---|
| 0 | -2.264542 | 0.505704 | Iris-setosa |
| 1 | -2.086426 | -0.655405 | Iris-setosa |
| 2 | -2.367950 | -0.318477 | Iris-setosa |
| 3 | -2.304197 | -0.575368 | Iris-setosa |
| 4 | -2.388777 | 0.674767 | Iris-setosa |
| ... | ... | ... | ... |
| 145 | 1.870522 | 0.382822 | Iris-virginica |
| 146 | 1.558492 | -0.905314 | Iris-virginica |
| 147 | 1.520845 | 0.266795 | Iris-virginica |
| 148 | 1.376391 | 1.016362 | Iris-virginica |
| 149 | 0.959299 | -0.022284 | Iris-virginica |
150 rows × 3 columns
Plot the principal components vs each other (don't need to code this yourselves)
px.scatter(finalDf, x = 'PC1', y='PC2', color = 'target')
Take a look at the variance ratios for the two principal components
pca.explained_variance_ratio_
array([0.72770452, 0.23030523])